Sheng and LaGleur used RNA-seq. They aligned their data using STAR to hg19 (it appears ENSEMBL, judging from the gene names given), yielding Integer counts. They did differential expression analysis using edgeR.
The RNA-seq methods description from their methods
Raw reads were aligned to hg19 or mm9 using STAR (v2.4.2a) with the parameter ‘‘quantMode’’ set as ‘‘GeneCounts’’ to produce count table for each gene. Differential gene analysis was performed on gene raw counts in R with edgeR package (v3.18.1) from bio- conductor. Read count table was filtered so that genes with at least one count across conditions were kept. The negative binomial generalized log-linear model was used in differential analysis. A FDR cut-off of 0.05 was used to determine significantly differentially expressed genes. The R package gProfileR (v0.6.4) was used to perform gene enrichment analysis on differential genes. Gene set enrichment analysis (GSEA) was performed with R package clusterProfiler (v3.4.4).
The function analyzeRepeats.pl from Homer (http://homer.ucsd.edu/homer/) software was used to get raw counts for repeats from RNA-seq data. Differential expression for repeats was performed with edgeR the same way as for genes.
This notebook depicts the re-analysis of human data
install.packages("readxl")
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/readxl_1.1.0.tgz'
Content type 'application/x-gzip' length 1452194 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
library(readxl)
package 'readxl' was built under R version 3.4.4
source("https://bioconductor.org/biocLite.R")
Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
A new version of Bioconductor is available after installing the most recent version of R; see http://bioconductor.org/install
biocLite("Biobase")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'Biobase'
trying URL 'https://bioconductor.org/packages/3.6/bioc/bin/macosx/el-capitan/contrib/3.4/Biobase_2.38.0.tgz'
Content type 'application/x-gzip' length 2157954 bytes (2.1 MB)
==================================================
downloaded 2.1 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("GEOquery")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'GEOquery'
trying URL 'https://bioconductor.org/packages/3.6/bioc/bin/macosx/el-capitan/contrib/3.4/GEOquery_2.46.15.tgz'
Content type 'application/x-gzip' length 13775074 bytes (13.1 MB)
==================================================
downloaded 13.1 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("limma")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'limma'
trying URL 'https://bioconductor.org/packages/3.6/bioc/bin/macosx/el-capitan/contrib/3.4/limma_3.34.9.tgz'
Content type 'application/x-gzip' length 2139937 bytes (2.0 MB)
==================================================
downloaded 2.0 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("edgeR")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'edgeR'
trying URL 'https://bioconductor.org/packages/3.6/bioc/bin/macosx/el-capitan/contrib/3.4/edgeR_3.20.9.tgz'
Content type 'application/x-gzip' length 2025192 bytes (1.9 MB)
==================================================
downloaded 1.9 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("gProfileR")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'gProfileR'
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/gProfileR_0.6.6.tgz'
Content type 'application/x-gzip' length 27898 bytes (27 KB)
==================================================
downloaded 27 KB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("GSVA")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'GSVA'
trying URL 'https://bioconductor.org/packages/3.6/bioc/bin/macosx/el-capitan/contrib/3.4/GSVA_1.26.0.tgz'
Content type 'application/x-gzip' length 1520135 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
The downloaded binary packages are in
/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T//RtmptzsYmT/downloaded_packages
biocLite("GO.db")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'GO.db'
installing the source package 'GO.db'
trying URL 'https://bioconductor.org/packages/3.6/data/annotation/src/contrib/GO.db_3.5.0.tar.gz'
Content type 'application/x-gzip' length 31663705 bytes (30.2 MB)
==================================================
downloaded 30.2 MB
* installing *source* package ‘GO.db’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (GO.db)
The downloaded source packages are in
'/private/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T/RtmptzsYmT/downloaded_packages'
biocLite("org.Hs.eg.db")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'org.Hs.eg.db'
installing the source package 'org.Hs.eg.db'
trying URL 'https://bioconductor.org/packages/3.6/data/annotation/src/contrib/org.Hs.eg.db_3.5.0.tar.gz'
Content type 'application/x-gzip' length 75219292 bytes (71.7 MB)
==================================================
downloaded 71.7 MB
* installing *source* package ‘org.Hs.eg.db’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (org.Hs.eg.db)
The downloaded source packages are in
'/private/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T/RtmptzsYmT/downloaded_packages'
biocLite("org.Mm.eg.db")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'org.Mm.eg.db'
installing the source package 'org.Mm.eg.db'
trying URL 'https://bioconductor.org/packages/3.6/data/annotation/src/contrib/org.Mm.eg.db_3.5.0.tar.gz'
Content type 'application/x-gzip' length 66556127 bytes (63.5 MB)
==================================================
downloaded 63.5 MB
* installing *source* package ‘org.Mm.eg.db’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (org.Mm.eg.db)
The downloaded source packages are in
'/private/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T/RtmptzsYmT/downloaded_packages'
source("../../pipelines/gsva_script.R")
biocLite("EnsDb.Hsapiens.v86")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.3 (2017-11-30).
Installing package(s) 'EnsDb.Hsapiens.v86'
installing the source package 'EnsDb.Hsapiens.v86'
trying URL 'https://bioconductor.org/packages/3.6/data/annotation/src/contrib/EnsDb.Hsapiens.v86_2.99.0.tar.gz'
Content type 'application/x-gzip' length 78178992 bytes (74.6 MB)
==================================================
downloaded 74.6 MB
* installing *source* package ‘EnsDb.Hsapiens.v86’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (EnsDb.Hsapiens.v86)
The downloaded source packages are in
'/private/var/folders/vv/r_1ps31s2pl1pcpzjtmz4chw0000gn/T/RtmptzsYmT/downloaded_packages'
library("Biobase")
library("GEOquery")
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
library("limma")
library("edgeR")
library("gProfileR")
package 'gProfileR' was built under R version 3.4.4
library("GSVA")
library("GO.db")
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:base':
expand.grid
library("org.Hs.eg.db")
library("org.Mm.eg.db")
library("EnsDb.Hsapiens.v86")
Loading required package: ensembldb
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicFeatures
Loading required package: AnnotationFilter
Attaching package: 'ensembldb'
The following object is masked from 'package:stats':
filter
human_rnaseq <- read.csv(file = "../../datasets/sheng_lafleur/GSE112230_2018-03-14_rnaseq_hs_counts.csv",header = TRUE,row.names = 1,sep = ",")
human_erv <- read.csv(file = "../../datasets/sheng_lafleur/GSE112230_2018-03-22_repeat_count_hs.csv",header = TRUE,row.names = 1,sep = ",")
#human_rnaseq_design <- data.frame()
human_rnaseq_design <- data.frame(cell = factor(rep("MCF7",ncol(human_rnaseq))))
rownames(human_rnaseq_design) <- colnames(human_rnaseq)
human_rnaseq_design$treatment <- factor(c(rep("shC",2),rep("shLSD1",2)))
human_rnaseq_design$rep <- factor(c(1,2,1,2))
print(human_rnaseq_design)
# model matrix
human_rnaseq_model_matrix <- model.matrix(~ treatment,human_rnaseq_design)
print(human_rnaseq_model_matrix)
(Intercept) treatmentshLSD1
MCF7_ribo.RNA_sh.C_rep1ReadsPerGene.out.tab 1 0
MCF7_ribo.RNA_sh.C_rep2ReadsPerGene.out.tab 1 0
MCF7_ribo.RNA_sh.LSD1_rep1ReadsPerGene.out.tab 1 1
MCF7_ribo.RNA_sh.LSD1_rep2ReadsPerGene.out.tab 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$treatment
[1] "contr.treatment"
# treat the sh-Control as the "baseline", so it gets an intercept term
set DGElist object
human_rnaseq_dge <- DGEList(counts=human_rnaseq,
genes = data.frame(genes = rownames(human_rnaseq))) # set DGE list
human_rnaseq_dge <- calcNormFactors(object = human_rnaseq_dge) # set normalization factors
human_rnaseq_dge <- estimateDisp(y = human_rnaseq_dge,
design = human_rnaseq_model_matrix) # set dispersion factors
plotBCV(y = human_rnaseq_dge)
# human_rnaseq_dge_exactTest <- exactTest(human_rnaseq_dge)
fit GLM
human_rnaseq_dge_fit <- glmQLFit(human_rnaseq_dge, human_rnaseq_model_matrix)
Perform tests
logFC_edgeR_lim <- 1
pval_cutoff <- 0.05
n_toptag <- 10000
# coef = 1 measures the baseline genes
human_rnaseq_dge_fit_coeff1_qlft <- glmQLFTest(glmfit = human_rnaseq_dge_fit,coef = 1)
# coef = 2 measures the fold-changes between the control and the fold change
human_rnaseq_dge_fit_coeff2_qlft <- glmQLFTest(glmfit = human_rnaseq_dge_fit,
coef = 2,
poisson.bound = TRUE)
# coef = 2, but we use a LRT.
human_rnaseq_dge_fit_coeff2_lrt <- glmLRT(glmfit = human_rnaseq_dge_fit,
coef = 2)
# coeff = 2 with "glmTreat". This option performs a fold-change cutoff
human_rnaseq_dge_fit_treat <- glmTreat(glmfit = human_rnaseq_dge_fit,
coef=2,
lfc=logFC_edgeR_lim)
# topTags; show top 10
human_rnaseq_dge_fit_coeff1_qlft_top <- topTags(human_rnaseq_dge_fit_coeff1_qlft)
human_rnaseq_dge_fit_coeff2_qlft_top <- topTags(human_rnaseq_dge_fit_coeff2_qlft,
p.value = pval_cutoff,
n = n_toptag,
adjust.method = "BH",sort.by = "logFC")
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim <- human_rnaseq_dge_fit_coeff2_qlft_top
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table <- subset(human_rnaseq_dge_fit_coeff2_qlft_top$table,abs(logFC) > logFC_edgeR_lim)
# human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table <- subset(human_rnaseq_dge_fit_coeff2_qlft_top$table,logFC > logFC_edgeR_lim)
human_rnaseq_dge_fit_coeff2_lrt_top <- topTags(object = human_rnaseq_dge_fit_coeff2_lrt,
p.value = pval_cutoff,
n = n_toptag,
adjust.method = "BH",sort.by = "PValue")
human_rnaseq_dge_fit_coeff2_treat_top <- topTags(object = human_rnaseq_dge_fit_treat,
p.value = pval_cutoff,
adjust.method = "BH",
n = n_toptag,sort.by = "logFC")
# plot the fold-changes-vs-pvalue and the top genes by the LRT
plot(human_rnaseq_dge_fit_coeff2_qlft$table$logFC,
-log10(human_rnaseq_dge_fit_coeff2_qlft$table$PValue),
pch=20)
points(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table$logFC,
-log10(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table$PValue),
col="red",pch=20)
# plotMD
plotMD(object = human_rnaseq_dge_fit_coeff2_qlft,adjust.method = "BH",p.value = pval_cutoff)
We must convert the ENSEMBL names to gene symbols
# get ENSEMBL annotation
ensembl_annot <- read.table(file = "../../datasets/transcripts/Homo_sapiens.GRCh38.91.names.txt",header = FALSE,sep = "\t")
ensembl_annot$V1 <- sapply(ensembl_annot$V1,
function(x) {
a <- unlist(strsplit(x = as.character(x),split = " "))
a[2]
})
ensembl_annot$V2 <- sapply(ensembl_annot$V2,
function(x) {
a <- unlist(strsplit(x = as.character(x),split = " "))
a[2]
})
colnames(ensembl_annot) <- c("ensid","symbol")
# obtain the symbols of the most differentially-expressed genes
# ens_names <- as.character(human_rnaseq_dge_fit_coeff2_lrt_top$table$genes)
ens_names <- as.character(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table$genes)
# split the given names
ens_names_split <- sapply(ens_names,
function(x) {
a <- unlist(strsplit(x = as.character(x),split = ".",fixed = TRUE))
a[1]
})
Match names
# library(dplyr)
matched_names <- sapply(ens_names_split,
function(x) {
# get the gene name
symbol <- unlist(subset(ensembl_annot,ensid %in% x)[,2])
# symbol <- unlist(subset(ensembl_annot,ensid %in% x)[,2])
symbol
})
# some of these genes may not have been identified in our annotation, so we will omit them for now
matched_names_is_annotated <- sapply(matched_names,
function(x) {
length(x) > 0})
matched_names_annotated <- matched_names[matched_names_is_annotated]
matched_names_annotated_df <- as.data.frame(do.call(rbind,matched_names_annotated))
colnames(matched_names_annotated_df) <- "symbol"
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_symbol <- matched_names_annotated_df #data.frame(symbols = matched_names)
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_ensg_symbol <- paste(rownames(matched_names_annotated_df),matched_names_annotated_df$symbol,sep = "_")
Isolate the counts for the genes of interest
temp_counts <- human_rnaseq_dge_fit$counts
temp_counts <- temp_counts[rownames(temp_counts) %in% rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_symbol),]
temp_counts_names <- sapply(rownames(temp_counts),
function (x) {
a.ind <- grep(pattern = x,x = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_ensg_symbol)
a <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_ensg_symbol[a.ind]
a
})
rownames(temp_counts) <- unlist(temp_counts_names)
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$counts <- temp_counts
pheatmap(log2(temp_counts+1),
show_rownames = FALSE,
annotation_col = human_rnaseq_design,
show_colnames = FALSE)
Some helpful code notes
vvv3@login04:~/pillaifolder/rmarkdown/tfh_diffn$ grep -s "gsva" *.Rmd
ranking_enhancerCluster_genes.Rmd: "pipelines/transcriptomics_pipeline/gsea/gsva_script.R"),
tfhdiff.Rmd:source("pipelines/transcriptomics_pipeline/gsea/gsva_script.R")
tfhdiff.Rmd:prepare_gsva()
tfhdiff.Rmd:tcell.genesets.tfh.enrich <- gsva_enrich(counts = tcell.enrich.counts$counts.aggregate,geneset = tcell.genesets)
vvv3@login04:~/pillaifolder/rmarkdown/tfh_diffn$ grep -s "tcell.genesets" *.Rmd
tfhdiff.Rmd:tcell.genesets <- import_gmt(gmtfile = "referenceAnnotation/hg38/genesets/msigdb/custom/msigdb_tcell_genesets.gmt")
tfhdiff.Rmd:tcell.genesets.tfh.enrich <- gsva_enrich(counts = tcell.enrich.counts$counts.aggregate,geneset = tcell.genesets)
tfhdiff.Rmd:#colnames(tcell.genesets.tfh.enrich$es.obs) <- c("CD4naive","PD1loCXCR5pos","PD1intCXCR5pos","PD1hiCXCR5hi")
tfhdiff.Rmd:tcell.genesets.tfh.enrich.es.kmeans <- kmeans_intervals(matrix = tcell.genesets.tfh.enrich$es.obs,kmax = 5,B=50)
tfhdiff.Rmd:tcell.genesets.tfh.enrich.es <- as.data.frame(tcell.genesets.tfh.enrich$es.obs)
tfhdiff.Rmd:tcell.genesets.tfh.enrich.es$k.choice <- tcell.genesets.tfh.enrich.es.kmeans$matrix.kmeans$kmeans
tfhdiff.Rmd:length(tcell.genesets.tfh.enrich.es$k.choice)
tfhdiff.Rmd:dim(tcell.genesets.tfh.enrich.es)
tfhdiff.Rmd:tcell.genesets.tfh.enrich.es <- tcell.genesets.tfh.enrich.es[order(tcell.genesets.tfh.enrich.es$k.choice),]
tfhdiff.Rmd:#enrichments <- tcell.genesets.tfh.enrich$es.obs[sort(tcell.genesets.tfh.enrich.es.kmeans$matrix.kmeans$kmeans),]
tfhdiff.Rmd:rowannotation.ssgsea <- data.frame(kchoice = as.factor(sort(tcell.genesets.tfh.enrich.es.kmeans$matrix.kmeans$kmeans)))
tfhdiff.Rmd:#rowannotation.ssgsea <- data.frame(kchoice = as.factor(tcell.genesets.tfh.enrich.es.kmeans$matrix.kmeans$kmeans))
tfhdiff.Rmd:head(tcell.genesets.tfh.enrich.es)
tfhdiff.Rmd:pheatmap(mat = tcell.genesets.tfh.enrich.es[,c(1:(ncol(tcell.genesets.tfh.enrich.es)-1))],
vvv3@login04:~/pillaifolder/rmarkdown/tfh_diffn$
# prepare for GSVA
prepare_gsva()
# import some gene sets for GSVA
kegg_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/msigdb_c2.cp.kegg_hsapiens.gmt")
incomplete final line found on '../../referenceAnnot/genesets/hg38/msigdb_c2.cp.kegg_hsapiens.gmt'
bp_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/msigdb_c5.bp_hsapiens.gmt")
incomplete final line found on '../../referenceAnnot/genesets/hg38/msigdb_c5.bp_hsapiens.gmt'
cancer_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/msigdb_c6_hsapiens.gmt")
incomplete final line found on '../../referenceAnnot/genesets/hg38/msigdb_c6_hsapiens.gmt'
immune_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/msigdb_c7_hsapiens.gmt")
incomplete final line found on '../../referenceAnnot/genesets/hg38/msigdb_c7_hsapiens.gmt'input string 375 is invalid in this localeinput string 376 is invalid in this localeinput string 385 is invalid in this localeinput string 386 is invalid in this localeinput string 387 is invalid in this locale
# from the Li et al oncotarget 2014 paper
interferon_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/custom/msigdb_interferon.gmt")
cytochemokine_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/custom/msigdb_cytochemokine.gmt")
inflammation_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/custom/msigdb_inflammation.gmt")
antigen_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/custom/msigdb_antigen.gmt")
# virus-related terms
virus_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/custom/msigdb_virus.gmt")
# hallmark sets--verified in microarrays
hallmark_genesets <- import_gmt(gmtfile = "../../referenceAnnot/genesets/hg38/msigdb_hallmark_hsapiens.gmt")
incomplete final line found on '../../referenceAnnot/genesets/hg38/msigdb_hallmark_hsapiens.gmt'
# import counts into gsva script
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts <- import_counts(countsfile = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$counts,imported_counts = TRUE)
# perform GSVA with KEGG
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_kegg_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = kegg_genesets)
# perform GSVA with Cancer pathways
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_cancer_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = cancer_genesets)
# perform GSVA with Immune pathways
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_immune_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = immune_genesets)
# perform GSVA with Immune pathways
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = bp_genesets)
# perform GSVA with GO terms from Li et al Oncotarget
interferon_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = interferon_genesets)
cytochemokine_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = cytochemokine_genesets)
inflammation_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = inflammation_genesets)
antigen_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = antigen_genesets)
# perform GSVA with viral terms
virus_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = virus_genesets)
# perform GSVA with hallmark terms
hallmark_enrich <- gsva_enrich(counts = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate,
geneset = hallmark_genesets)
Compact naming of pathways
name_pathway <- function(x) {
pathway_name <- sapply(x,
function(y) {
a <- unlist(strsplit(x = as.character(y),split = ":",fixed = TRUE))
a[2]
})
return(pathway_name)
}
Set the terms of the signatures that we want
# set li oncotarget terms
li_oncotarget_terms <- paste("interferon",
"cytokine",
"inflammation",
"influenza",
"testes",
"presentation",
sep = "|")
Plot enrichment
pheatmap(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_kegg_enrich,
annotation_col = human_rnaseq_design,
show_colnames = FALSE,
fontsize = 5.0,
main = "KEGG")
pheatmap(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_cancer_enrich,
annotation_col = human_rnaseq_design,
show_colnames = FALSE,
fontsize = 5.0,
main = "Cancer")
set.seed(0)
kmeans_immune <- kmeans(x = human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_immune_enrich,
centers = 4)
kmeans_immune_annot <- as.data.frame(sort(kmeans_immune$cluster))
colnames(kmeans_immune_annot) <- c("cluster")
human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_immune_enrich <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_immune_enrich[rownames(kmeans_immune_annot),]
pheatmap(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_immune_enrich,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_immune_annot,
cluster_rows = FALSE,
show_rownames = FALSE,
show_colnames = FALSE,
fontsize = 5.0,
main = "Immune")
# highlight the immunological GO_BP terms
c5_bp_li_oncotarget <- grep(pattern = li_oncotarget_terms,x = rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich),ignore.case = TRUE)
c5_bp_not_li_oncotarget <- grep(pattern = li_oncotarget_terms,x = rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich),ignore.case = TRUE,invert = TRUE)
c5_bp_annot <- data.frame(li_oncotarget = c(rep("li_onco",length(c5_bp_li_oncotarget)),
rep("non_li_onco",length(c5_bp_not_li_oncotarget))))
rownames(c5_bp_annot) <- c(rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich)[c5_bp_li_oncotarget],
rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich)[c5_bp_not_li_oncotarget])
#c5_bp_annot <- c5_bp_annot[order(c5_bp_annot$),]
c5_bp_sorted <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_bp_enrich[rownames(c5_bp_annot),]
pheatmap(c5_bp_sorted,
annotation_col = human_rnaseq_design,
annotation_row = c5_bp_annot,
show_rownames = FALSE,
show_colnames = FALSE,
fontsize = 5.0,
main = "BP")
Enrichments for terms from Li et al 2014 and Chiappinelli et al 2015.
pheatmap(interferon_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(interferon_enrich),
show_colnames = FALSE,
cluster_cols = TRUE,
fontsize = 2.5,
main = "Interferon GO Terms")
pheatmap(cytochemokine_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(cytochemokine_enrich),
show_colnames = FALSE,
fontsize = 5.0,
cluster_cols = FALSE,
main = "Cytokine/Chemokine GO Terms")
pheatmap(inflammation_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(inflammation_enrich),
cluster_rows = FALSE,
cluster_cols = FALSE,
show_colnames = FALSE,
fontsize = 5.0,
main = "Inflammation GO terms")
pheatmap(antigen_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(antigen_enrich),
show_rownames = TRUE,
fontsize = 4.0,
show_colnames = FALSE,
cluster_cols = FALSE,
main = "Antigen Presentation and Receptor GO Terms")
pheatmap(virus_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(virus_enrich),
show_rownames = TRUE,
show_colnames = FALSE,
cluster_cols = FALSE,
fontsize_row = 2.5,
main = "Virus GO Terms")
Hallmarks
pheatmap(hallmark_enrich,
annotation_col = human_rnaseq_design,
labels_row = rownames(hallmark_enrich),
show_rownames = TRUE,
show_colnames = FALSE,
cluster_cols = FALSE,
fontsize_row = 5.0,
main = "Hallmark GO Terms")
We see some pathway enrichments.
In the Sheng and LaFleur paper, the following procedure was used to get the ERV signature:
For analyzing LSD1 expression level versus IFN signature, genes were extracted from ‘virus’ and ‘interferon’ related terms of the biological pathways enriched in upregulated genes upon LSD1 KD in MCF7 cells. These genes were further filtered for those that are upregulated in KD condition to obtain a refined IFN signature gene list. Gene-wise z-score was calculated in each tumor type and the sum of genes in the refined IFN signature gene list was used to characterize IFN signature. Pearson’s correlation was used to evaluate the correlation of LSD1 level (z-score) and IFN signature
Let’s divide the “interferon” GSVA analysis further.
set.seed(0)
# should do a gap-statistic calculation
kmeans_interferon <- kmeans(x = interferon_enrich,
centers = 4)
kmeans_interferon_annot <- as.data.frame(sort(kmeans_interferon$cluster))
colnames(kmeans_interferon_annot) <- c("cluster")
interferon_enrich <- interferon_enrich[rownames(kmeans_interferon_annot),]
pheatmap(interferon_enrich,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_interferon_annot,
cluster_rows = FALSE,
cluster_cols = FALSE,
show_colnames = FALSE,
fontsize = 3.0,
main = "Interferon")
Specifically, let’s look at the cluster with largely uniform increased expression of the pathways in the shLSD1 cases
chosen_ifn_k <- 1
interferon_enrich_subset <- interferon_enrich[rownames(subset(kmeans_interferon_annot,cluster %in% c(chosen_ifn_k))),]
pheatmap(interferon_enrich_subset,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_interferon_annot,
cluster_rows = FALSE,
show_colnames = FALSE,
fontsize = 5.0,
main = "Interferon, enriched")
Let’s do the same with the viral signatures
set.seed(0)
# should do a gap-statistic calculation
kmeans_viral <- kmeans(x = virus_enrich,
centers = 4)
kmeans_viral_annot <- as.data.frame(sort(kmeans_viral$cluster))
colnames(kmeans_viral_annot) <- c("cluster")
virus_enrich <- virus_enrich[rownames(kmeans_viral_annot),]
pheatmap(virus_enrich,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_viral_annot,
cluster_rows = FALSE,
show_colnames = FALSE,
fontsize = 3.0,
main = "Interferon")
chosen_viral_k <- 4
viral_enrich_subset <- virus_enrich[rownames(subset(kmeans_viral_annot,cluster %in% c(chosen_viral_k))),]
pheatmap(viral_enrich_subset,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_viral_annot,
cluster_rows = FALSE,
show_colnames = FALSE,
fontsize = 5.0,
main = "Viral")
Let’s test if the enrichment differences between one set of samples and another for this cluster are significant. For this, we will bootstrap a set of enrichment values from the samples for each signature and construct a distribution of the differences in mean enrichment.
# virus
random_sig <- apply(X = virus_enrich,
MARGIN = 1,
function(x) {
set.seed(0)
a <- sample(x = x,size = length(x),replace = TRUE)
mean(a[3:4]) - mean(a[1:2])
})
hist(random_sig)
q_cutoffs <- c(mean(random_sig) - sd(random_sig)*qnorm(p = 0.025),
mean(random_sig) + sd(random_sig)*qnorm(p = 0.025))
abline(v = q_cutoffs,
lty=2,
col="red")
# plot each of the other
test_sig <- apply(X = virus_enrich,
MARGIN = 1,
function(x) {
a <- mean(x[3:4]) - mean(x[1:2])
a_col <- "grey"
if (a >= q_cutoffs[1] | a <= q_cutoffs[2]) {
a_col <- "pink"
}
abline(v = a,col = a_col,lty=3)
a >= q_cutoffs[1] | a <= q_cutoffs[2]
})
kmeans_viral_annot$sig_boot <- as.factor(test_sig)
pheatmap(virus_enrich,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_viral_annot,
cluster_rows = FALSE,
show_colnames = FALSE,
fontsize = 3.0,
main = "Virus")
# interferon
random_sig <- apply(X = interferon_enrich,
MARGIN = 1,
function(x) {
set.seed(0)
a <- sample(x = x,size = length(x),replace = TRUE)
mean(a[3:4]) - mean(a[1:2])
})
hist(random_sig)
q_cutoffs <- c(mean(random_sig) - sd(random_sig)*qnorm(p = 0.025),
mean(random_sig) + sd(random_sig)*qnorm(p = 0.025))
abline(v = q_cutoffs,
lty=2,
col="red")
# plot each of the other
test_sig <- apply(X = interferon_enrich,
MARGIN = 1,
function(x) {
a <- mean(x[3:4]) - mean(x[1:2])
a_col <- "grey"
if (a >= q_cutoffs[1] | a <= q_cutoffs[2]) {
a_col <- "pink"
}
abline(v = a,col = a_col,lty=3)
a >= q_cutoffs[1] | a <= q_cutoffs[2]
})
kmeans_interferon_annot$sig_boot <- as.factor(test_sig)
pheatmap(interferon_enrich,
annotation_col = human_rnaseq_design,
annotation_row = kmeans_interferon_annot,
cluster_rows = FALSE,
show_colnames = FALSE,
fontsize = 3.0,
main = "Interferon")
Good; all of the signatures in the desired clusters have significant mean enrichment changes.
Combine the gene sets in the desired clusters
viral_enriched_gos <- subset(kmeans_viral_annot,cluster %in% chosen_viral_k)
ifn_enriched_gos <- subset(kmeans_interferon_annot,cluster %in% chosen_ifn_k)
enriched_sig_gos <- unique(c(rownames(viral_enriched_gos),
rownames(ifn_enriched_gos))) # since there is some overlap
Identify the genes in the enriched, significantly-expressed GO terms
# template dataset
all_viral_ifn_gos <- unique(do.call(c,list(virus_genesets,interferon_genesets)))
names(all_viral_ifn_gos) <- unique(c(names(virus_genesets),names(interferon_genesets)))
all_viral_ifn_gos <- all_viral_ifn_gos[names(all_viral_ifn_gos) %in% enriched_sig_gos]
all_viral_ifn_genes <- unique(unlist(all_viral_ifn_gos))
Obtain the differentially-expressed genes (from the sequencing experiment) that are present in these genes
# get ifn and viral genes
dex_genes <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_symbol
ifn_viral_genes <- subset(dex_genes,symbol %in% all_viral_ifn_genes)
# get those genes that are upregulated
# also impose a logCPM count limit of at least 0
upreg <- subset(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table,
logFC > logFC_edgeR_lim & logCPM >= 0)
upreg <- subset(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table,
logFC > logFC_edgeR_lim)
upreg_genes <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_symbol[rownames(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$dex_symbol) %in% upreg$genes,]
upreg_genes <- unlist(upreg_genes)
ifn_viral_signature <- subset(ifn_viral_genes,symbol %in% upreg_genes)
temp_aggregate_counts <- human_rnaseq_dge_fit_coeff2_qlft_top_lfclim_gsva_counts$counts.aggregate
ifn_viral_signature_counts <- temp_aggregate_counts[rownames(temp_aggregate_counts) %in% ifn_viral_signature$symbol,]
pheatmap(log2(ifn_viral_signature_counts+1),
show_rownames = FALSE,
annotation_col = human_rnaseq_design,
show_colnames = FALSE)
# plot distribution of signature fold changes
ifn_viral_signature_table <- subset(human_rnaseq_dge_fit_coeff2_qlft_top_lfclim$table,genes %in% rownames(ifn_viral_signature))
hist(ifn_viral_signature_table$logCPM)
hist(ifn_viral_signature_table$logFC)
Save this signature
write.table(x = ifn_viral_signature,
file = "ifn_viral_signature_shenglafleur_hs.txt",
quote = FALSE,sep = "\t",row.names = TRUE,col.names = TRUE)
write.table(x = ifn_viral_signature_table,
file = "ifn_viral_signature_table_shenglafleur_hs.txt",
quote = FALSE,sep = "\t",row.names = TRUE,col.names = TRUE)
Which gene signatures are most correlated with ERV expression? We want to come up with a different ERV-activation gene signature for this
# need to debug kegga
golrt <- kegga(de = human_rnaseq_dge_fit_coeff2_qlft,geneid = human_rnaseq_dge_fit_coeff2_qlft$genes$genes)
# de = human_rnaseq_dge_fit_coeff2_lrt,
# species="Hs")
topGO(golrt)
# why do pathways not overlap with the universe?
New from edgeR
human_rnaseq_dge_fit_coeff2_go <- goana(de = human_rnaseq_dge_fit_coeff2, species="Hs")